A Framework for Characterizing Feature Weighting and Selection Methods in Text Classification

نویسندگان

Janez Brank

Natasa Milic-Frayling

Jozef Stefan

چکیده

Optimizing performance of classification models often involves feature selection to eliminate noise from the feature set or reduce computational complexity by controlling the dimensionality of the feature space. A refinement of the feature set is typically performed in two steps: by scoring and ranking the features and then applying a selection criterion. Empirical studies that explore the effectiveness of feature selection methods are typically limited to identifying the number or percentage of features to be retained in order to maximize the classification performance. Since no further characterizations of the feature set are considered beyond its size, we currently have a limited understanding of the relationship between the classifier performance and the properties of the selected set of features. This paper presents a framework for characterizing feature weighting methods and selected features sets and exploring how these characteristics account for the performance of a given classifier. We illustrate the use of two feature set statistics: cumulative information gain of the ranked features and the sparsity of data representation that results from the selected feature set. We apply a novel approach of synthesizing ranked lists of features that satisfy given cumulative information gain and sparsity constraints. We show how the use of synthesized rankings enables us to investigate the degree to which the feature set properties explain the behaviour of a classifier, e.g., Naïve Bayes classifier, when used in conjunction with different feature weighting schemes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

A New Framework for Distributed Multivariate Feature Selection

Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...

متن کامل

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

A Framework for Characterizing Feature Weighting and Selection Methods in Text Classification

نویسندگان

چکیده

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

A New Framework for Distributed Multivariate Feature Selection

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

عنوان ژورنال:

اشتراک گذاری